Week 13 : Review and Final Exam
The University of Sydney
🗓 Date: Monday 24 November 2025
🕒 Starting Time: 5:00pm
📍 Location: Check Your Exam Timetable
This presentation is based on the SOLES reveal.js Quarto template and is licensed under a Creative Commons Attribution 4.0 International License.
Multiple Choice Section (9 questions 18 marks)
Extended Answer Section (7 questions 42 marks)
In summary, the final exam accounts for 50% of the unit’s total mark.
There are two correct answers for each question. Choose all correct answers for each question. Each question is worth two marks. The answer needs to be completely correct to receive the two marks. The total mark for this section is 18.
• You can select at most two options for each question. Otherwise, you automatically get 0 for the question.
• For every correct response that is selected, a mark is awarded. For every incorrect response that is selected, a mark is deducted.
• A mark for a question cannot be negative even if only incorrect responses are selected.
• Non-selected responses do not modify any awarded marks. There are no marks awarded for not answering a question.
Your answers must be entered on the Multiple Choice Answer Sheet.
\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}
These metrics focus on how closely predicted values align with actual values.
Mean Squared Error (MSE):
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
Residual Sum of Squares (RSS):
\text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
R-Squared: Measures the proportion of variance in the target variable explained by a linear model, providing an overall measure of goodness-of-fit.
R^2=\frac{\sum_{i=1}^n(y_i-\bar y)^2-\sum_{i=1}^n(y_i-\hat y_i)^2}{\sum_{i=1}^n(y_i-\bar y)^2}
Adjusted R-Squared: adjusted for the number of predictors, used for model selections.
For classification, metrics assess how well a model correctly classifies categorical outcomes.
\text{Accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total Samples}}
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}} \text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives} + \text{False Positives}}
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}
\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision + Recall}}
Cohen’s Kappa \kappa = \frac{p_o - p_e}{1 - p_e}
Area Under the ROC Curve (AUC-ROC):
Question: When dealing with imbalanced classes in a classification problem, which performance metrics are most effective for evaluating model performance?
\begin{align*} Y = \beta_0 + \beta_1 X_1 + \cdots+ \beta_p X_p + \varepsilon \end{align*}
Find coefficients to minimize the total sum of squares of the residuals
\begin{align*} \min_{\boldsymbol{\beta}}& \sum_{i = 1}^n (Y_i - \beta_0 - \sum_{j = 1}^p\beta_j X_{ij})^2 \qquad \text{subject to}\qquad \sum_{j = 1}^p |\beta_j|\le s.\\ \min_{\boldsymbol{\beta}}& \sum_{i = 1}^n (Y_i - \beta_0 - \sum_{j = 1}^p\beta_j X_{ij})^2 \qquad \text{subject to}\qquad \sum_{j = 1}^p \beta_j^2\le s. \end{align*}
R to fit lasso or ridge? If you want to fit a lasso model, what value you should set for the $\alpha$ in that function?It repeats the k-fold CV process multiple times, each with different random splits.
This helps to: provide a less biased CV test error estimate. provide the variance of the CV error.
It comes with a computational cost.
\begin{align*} p_k(x) = \color{red}{P(Y = k| X = x)} = \frac{\color{blue}{\pi_k} f_k(x)}{\sum_{\ell = 1}^K\pi_\ell f_\ell(x)} \end{align*}
Posterior: The probability of classifying observation to group k given it has features x
Prior: The prior probability of an observation in general belonging to group k
How can the out-of-sample (test) error be estimated when using a bagging model?
What are the key hyper-parameters need to be tuned when fitting gradient boosting? How does these hyper-parameters impact on the bias-variance trade-off in the model performance?
\begin{align*} \mathbb{E}[g(X)] = \int g(t) f(t)\, dt \approx \frac{1}{N} \sum_{i = 1}^N g(X_i) \end{align*}
A typical model in this case is \begin{align*} Y_i = f(x_i) + \varepsilon_i \end{align*} * The function f is some smooth function (differentiable)
\begin{align*} f(x_1, x_2, \ldots, x_n|\boldsymbol{\theta}) \end{align*}
\begin{align*} L(\boldsymbol{\theta}|\boldsymbol{x}) = \prod_{i = 1}^n f(x_i |\boldsymbol{\theta}) \leadsto \ell(\boldsymbol{\theta}|\boldsymbol{x}) = \log L(\boldsymbol{\theta}|\boldsymbol{x}) = \sum_{i = 1}^n \log f(x_i |\boldsymbol{\theta}) \end{align*}
What is a good practice to avoid overfitting?
- What is overfitting?
- If a model is overfitting, what does this imply about the bias-variance trade-off in the model performance?
A. Use a complicated model that includes all possible interaction terms and higher order terms of the covariates.
B. Using a two-part loss function which includes a regulariser to penalize model complexity.
C. Using a good optimizer to minimize error on training data.
D. Use cross-validation to monitor the generalisation performance.
What of the following statements about the linear discriminant analysis are correct?
A. The assumptions in linear discriminant analysis are that the features in each of the groups is a sample from an arbitrary multivariate distribution, and all of the populations have the same mean vector.
B. The assumptions in linear discriminant analysis are that the features in each of the groups is a sample from a multivariate normal, and all of the populations have the same covariance matrix.
C. Linear discriminant analysis directly models the probability of the label given the features.
D. Linear discriminant analysis requires features to be numeric.
Which of the following are supervised learning techniques?
A. K-means clusters
B. Random Forest
C. Linear Discriminant Analysis
D. Density estimation
Which of the following are characteristics of a kernel function (as used in density estimation)?
A. a frequency function from a histogram
B. a symmetric function
C. a function ranging from -1 to 1
D. a function that integrates to 1 over its support
Which of the following practices may overestimate the test performance?
A. Using PCA to construct new independent features from the original features. B. Imputing missing values using the mean calculated from the entire dataset
C. Using 10-fold cross-validation to assess model performance. D. To address class imbalance, reporting Cohen’s kappa from the test set as the overall performance metric.
A. SVM aims to find the hyperplane that maximises the margin between different classes.
B. SVMs can only be used for linear classification problems.
C. Changes in the position of the support vectors will not impact the decision boundary.
D. Increasing the value of C in the SVM’s optimisation function will lead to an increase in the bias but the model will generalises better to the unseen data.
Which of the following are indirect measures of the test error?
C_p = \frac{1}{n} \left(\text{RSS} + 2d \widehat{\sigma}^2\right)
\text{RSS} = \sum_{i = 1}^n (Y_i - \widehat Y_i)^2
\text{BIC} = \frac{1}{n}\left( \text{RSS} + \log(n) d \widehat{\sigma}^2 \right)
F_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
where in the above:
\widehat Y_i is the predicted response for the ith observation; d is the number of features in the model, not including the intercept;
Your friend recently started an internship as a data analyst at a university student support unit. The team is interested in building a model to predict whether a student is at risk of dropping out, using available academic and engagement data collected from the learning management system.
Your friend explains the dataset consists of 5,000 observations, where each data point is represented as (\mathbf{x}_i, y_i) for i = 1, \dots, 5000. \mathbf{x}_i contains 80 features, including number of logins, average time spent per week on the platform, assignment grades, and forum participation. y_i = 1 means the student eventually dropped out, while y_i = 0 means the student successfully completed the semester. Around 2% of students in the dataset dropped out.
Here’s the modeling workflow your friend followed:
They noticed a few missing values in some features and filled them using the mean of each variable.
Then, they applied feature selection on the imputed full dataset, selecting the top 20 features most correlated with the target variable.
Next, they randomly split the data into 75% training and 25% testing sets.
Finally, they trained a SVM classifier and evaluated it using test set accuracy, achieving 92% accuracy.
Based on your understanding of statistical learning and good modeling practice, identify three problematic aspects of your friend’s modeling workflow. For each issue: briefly explain why it is problematic, and suggest an alternative.
You are planning your retirement and decide that you will retire with $1,000,000 invested in an index fund. During retirement you plan to withdraw $50,000 each year from your investment with the remaining money being invested in an index fund. Assume the index fund has an average return rate of 9% and a standard deviation of 15% (normally distributed). Assume you retire at 65 and will live until you are 100, and the CPI adjustment is 104% each year. Compute the probability that your investment will support your lifestyle until you die.
Your friend uses Monte Carlo to study your retirement plan and proposes the pseudo code to solve this problem. Evaluate this code and fill in <1> to <4>.
# Set initial parameters
initial_investment <- 1000000
annual_withdrawal <- 50000
mean_return <- 0.09
sd_return <- 0.15
cpi <- 1.04
years <- 35
n_sim <- <1>
success_count <- 0
# Start Monte Carlo simulation
for (sim in 1:n_sim) {
investment <- initial_investment
withdrawal <- annual_withdrawal
for (year in 1:years) {
# Simulate annual return from normal distribution
annual_return <- random value from <2>
# Update investment value
investment <- investment * (1 + annual_return)
investment <- investment - withdrawal
# Check if investment is depleted
if (investment <= <3>) {
break
}
# Adjust withdrawal for inflation
withdrawal <- withdrawal * cpi
}
# Count simulation as success if funds last to age 100
if (investment > 0) {
success_count <- success_count + 1
}
}
# Estimate and print success probability
success_probability <- <4>
print(success_probability)Write down an expression for <1> to <4> respectively.